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Abstract 

Given a rewriteable text T of length n on an alphabet of size a, we propose an online algorithm that 
computes the sparse suffix array and the sparse longest common prefix array of T in Cl(|C| vJgF + m Igm Ign Ig* n) 
time by using the text space and 0(m) additional working space, where m is the number of some positions 
V on provided online and arbitrarily, and C = IJp p'gp [p..p+ lcp{T[p.],T[p'.])]. 


1 Introduction 

Sorting suffixes of a long text lexicographically is an important first step for many text processing algo¬ 
rithms [T3]. The complexity of the problem is quite well understood, as for integer alphabets suffix sorting 
can be done in optimal linear time |10] , and also almost in-place [12] . In this article, we consider a variant of 
the problem: instead of computing the order of every suffix, we address the sparse suffix sorting problem. 
Given a text T[l..n] of length n and a set V C [l..n] of m arbitrary positions in T, the problem asks for the 
(lexicographic) order of the suffixes starting at the positions in V. The answer is encoded by a permutation 
of V, which is called the sparse suffix array (SSA) of T (with respect to V). 

Like the “full” suffix arrays, we can enhance SSA(T, V) by the length of the longest common prefix (TCP) 
between adjacent suffixes in SSA(T, V), which we call the sparse longest common prefix array (SLCP). 
In combination, SSA(T, F) and SLCP(T, F) store the same information as the sparse suffix tree, i.e., they 
implicitly represent a compacted trie over all suffixes starting at the positions in V. This allows us to use 
the SSA as an efficient index for pattern matching, for example. 

Based on classic suffix array construction algorithms [iniiii], sparse suffix sorting is easily conducted 
in 0{n) time if 0{n) additional working space is available. For m = o(n), however, the needed working 
space may be too large, compared to the final space requirement of SSA(T). Although some special choices 
of V (e.g., evenly spaced suffixes or prefix codes) admit space-optimal 0{m) construction algorithms [5], the 
problem of sorting arbitrary choices of suffixes in small space seems to be much harder. We are aware of the 
following results: As a deterministic algorithm, Karkkainen et al. [10] gave a trade-off using C>(rm -I- n^/r) 
time and 0{m + n/^/r) working space with a parameter r € [l,yJi]. If randomization is allowed, there 
is a technique based on Karp-Rabin fingerprints, first proposed by Bille et al. |3] and later improved by I 
et al. |5|. The latest one works in O(nlgn) expected time and 0(m) additional space. 

1.1 Computational Model 

We assume that the text of length n is loaded into RAM. Our algorithms are allowed to overwrite parts of 
the text, as long as they can restore the text into its original form at termination. Apart from this space, we 
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are only allowed to use 0{m) additional words. The positions in V are assumed to arrive on-line, implying 
in particular that they need not be sorted. We aim at worst-case efficient deterministic algorithms. 

Our computational model is the word RAM model with word size f2(lgn). Here, characters use [log cr] 
bits and can hence be packed into log^ n words, where cr is the alphabet size. Comparing two strings X 
and Y takes 0{lcp{X^Y)/ \g^n) time, where lcp{X,Y) denotes the length of the longest common prefix of 
X and Y. 

1.2 Algorithm Outline and Our Results 

Our main algorithmic idea is to insert the suffixes starting at positions of V into a self-balancing binary search 
tree [S]; since each insertion invokes 0(lgm) suffix-to-suffix comparisons, the time complexity is 0(t^m\gm), 
where ts is the cost for each suffix-to-suffix comparison. If all suffix-to-suffix comparisons are conducted by 
naively comparing the characters, the resulting worst case time complexity is 0{nmlgm). In order to speed 
this up, our algorithm identifies large identical substrings at different positions during different suffix-to- 
suffix comparisons. Instead of performing naive comparisons on identical parts over and over again, we build 
a data structure (stored in redundant text space) that will be used to accelerate subsequent suffix-to-suffix 
comparisons. Informally, when two (possibly overlapping) substrings in the text are detected to be the same, 
one substring can be overwritten. 

With respect to the question of “how” to accelerate suffix-to-suffix comparisons, we focus on a technique 
that is called edit sensitive parsing (ESP) [5]. Its properties allow us to compute the longest common 
extension (LCE) of any two substrings efficiently. We propose a new variant of ESP, which we call 
hierarchical stable parsing (HSP), in order to build our data structure for LCE queries efficiently in 
text space. 

We make the following definition that allows us to analyse the running time more accurately. Define 
^ •= Up p'ev p^p'Ip--P ^cp{T[p--],T[p'..])] as the set of positions that must be compared for distinguishing 
the suffixes from V. Then sparse suffix sorting is trivially lower bounded by D(|C| / Ig,,. n) time. 

Theorem 1.1. Given a text T of length n that is loaded into RAM, the SSA and SLCP of T for a set of 
m arbitrary positions can be computed deterministically in 0(|C| i/lgn -|- mlgmlgnlg* n) time, using Olrn) 
additional working space. 

Note that the running time may actually be sublinear (excluding the loading cost for the text). To 
the best of our knowledge, this is the first algorithm having the worst-case performance guarantee close 
to the lower bound. All previously mentioned (deterministic and randomized) algorithms take D(n) time 
even if we exclude the loading cost. Also, general string sorters (e.g., forward radix sort [1] or three-way 
string quicksort M), which do not take advantage of the suffix overlapping, suffer from the lower bound of 
D(£/ Igg. n) time, where £ is the sum of all TCP values in the SLCP, which is always at least \C\, but can in 
fact be 0(nm). 

1.3 Relationship between sufRx sorting and LCE queries 

The LCE-problem is to preprocess a text T such that subsequent LCE-queries lce{i,j) := lcp{T[i..],T[j..]) 
giving the length of the longest common prefix of the suffixes starting at positions i and j can be answered 
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efficiently. The currently best data structure for LCE is due to Bille et al. H], who proposed a deterministic 
algorithm that builds a data structure in 0(n/r) space and 0 ( 71 ^+*^) time (e > 0) answering LCE queries in 
0{t) time, for any 1 < r < n. 

Data structures for LCE and sparse suffix sorting are closely related, as shown in the following observation: 

Observation 1.2. Given a data structure that computes LCE in 0 {t) time for t > 0, we can compute sparse 
suffix sorting for m positions in Oirmlgm) time (using balanced binary search trees as outlined above). 

Conversely, given an algorithm computing the SSA and the SLCP of a text T of length n for m positions 
in 0(m) space and 0{f{n,m)) time for some f, we can construct a data structure in 0{f{n,m)) time and 
0{m) space, answering LCE queries on T in 0{n^/mf') time (using a difference cover sampling modulo 
n/m 

As a tool for our sparse suffix sorting algorithm, we first develop a data structure for LCE-queries with 
the following properties. 

Theorem 1.3. There is a data structure using OfnjT) space that answers LCE queries in 0(\g* n (lg(n/r) + r>gVlgan)) 
time, where 1 < r < n. We can build the data structure in 0{n (\g* n + (lgn)/r+ (lgT)/lg^n)) time with 
additional words during construction. 

An advantage of our data structure against the deterministic data structures in ^ is its faster construction 
time, which is roughly 0(nlgn) time. 

2 Preliminaries 

Let E be an ordered alphabet of size a. We assume that a character in S is represented by an integer. For 
a string X G E*, let |Ar| denote the length of X. For a position i in X, let X[i] denote the f-th character 
of X. For positions i and j, let X[i..j] = + 1] • • • X[j\. For W = XYZ with X,Y, Z G E*, we call 

X, Y and Z a prefix, substring, suffix of W, respectively. In particular, the suffix beginning at position i is 
denoted by X[L.]. 

An interval I = [b..e] is the set of consecutive integers from b to e, for b < e. For an interval I, we use 
the notations b(I) and e(I) to denote the beginning and end of I; i.e., I = [b(I)..e(I)]. We write \I\ to 
denote the length of I; i.e., \T\ = e(I) — b(Z) + 1. 

3 Answering LCE queries with ESP Trees 

Edit sensitive parsing (ESP) and ESP trees were proposed by Cormode and Muthukrishnan [5] to approx¬ 
imate the edit distance with moves efficiently. Here, we show that it can also be used to answer LCE 
queries. 

3.1 Edit Sensitive Parsing 

The aim of the ESP technique is to decompose a string Y G Y* into substrings of length 2 or 3 such that 
each substring of this decomposition is determined uniquely by its neighboring characters. To this end, it 
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Figure 1: We parse the string Y = babababbbbbbbabbbbbabababaaaaab. The string is divided into blocks, 
and each block gets assigned a new character (represented by the rounded boxes). 

first identifies so-called meta-blocks in Y, and then further refines these meta-blocks into blocks of length 2 
or 3. 

The meta-blocks are created in the following 3-stage process (see also Figurefor an example): 

(1) Identify maximal regions of repeated symbols (i.e., maximal substrings of the form for c S E and 
£ > 2). Such substrings form the type 1 meta-blocks. 

(2) Identify remaining substrings of length at least 2 (which must lie between two type 1 meta-blocks). 
Such substrings form the type 2 meta-blocks. 

(3) Any substring not yet covered by a meta-block consists of a single character and cannot have type 2 
meta-blocks as its neighbors. Such characters Y[i] are fused with the type 1 meta-block to their right[^ 
or, if Y[i] is the last character in Y, with the type 1 meta-block to its left. The meta-blocks emerging 
from this are called typeM (mixed). 

Meta-blocks of type 1 and typeM are collectively called repeating meta-blocks. 

Although meta-blocks are defined by the comprising characters, we treat them as intervals on the text 
range. 

Meta-blocks are further partitioned into blocks, each containing two or three characters from E. Blocks 
inherit the type of the meta-block they are contained in. How the blocks are partitioned depends on the 
type of the meta-block: 

Repeating meta-blocks. A repeating meta-block is partitioned greedily: create blocks of length three 
until there are at most four, but at least two characters left. If possible, create a single block of length 
2 or 3; otherwise create two blocks, each containing two characters. 

Type-2 meta-blocks. A type 2 meta-block p, is processed in 0{\p\\g* a) time by a technique called al¬ 
phabet reduction [3. The first Ig* a characters are blocked in the same way as repeating meta-blocks. 
Any remaining block /? is formed such that /3’s interval boundaries are determined by y[max(b(/3) — 
Al, b(^)).. min(e(,d) -|- Ar, e(/r))], where Al := |"lg* cr] -|- 5 and Ar := 5 (see [3 Lemma 8]). 

We call the substring y[b(/3) — AL..e(/3) -|- Ar] the local surrounding of /3, if it exists. Blocks whose 
local surrounding exist are also called surrounded. 

Let E C E^UE^ denote the set of blocks resulting from ESP (the “new alphabet”). We use esp: E* —S* 
to denote the function that parses a string by ESP and returns a string in E*. 

^The original version prefers the left meta-block, but we change it for a more stable behavior (cf. Figure |13| 
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3.2 Edit Sensitive Parsing Trees 


Applying esp recursively on its output generates a context free grammar (CFG) as follows. Let Ig := Y 
be a string on an alphabet Eg := E with ug = |Eo|. The output of Yh := esp^{Y) = esp{esp'^~^{Y)) is a 
sequence of blocks, which belong to a new alphabet E/j {h > 0). A block b G Yh contains a string b G 
of length two or three. Since each application of esp reduces the string length by at least 1/2, there is a 
k = Oi\g |y|) such that esp{Yk) returns a single block r. We write V := Ui<ft,<fc blocks 

inY„Y2,...,Yk. 

We use a (deterministic) dictionary ID : E?i —)• E^_^ U E^_;^ to map a block to its characters, for each 
1 < h < k. The dictionary entries are of the form b ^ xy ot b ^ xyz, where b G Yh and x,y,z G Yh-i- 
The CFG for Y is represented by the non-terminals V, the terminals Eg, the dictionary £>, and the start 
symbol r. This grammar exactly derives Y. 

Our representation differs from that of Cormode and Muthukrishnan [5] because it does not use hash 
tables. 
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Figure 2: Given the string Y = aaaaaaaaaaaaaaaabababa, we parse Y with ESP and build ET(E). 
Nodes/characters belonging to the same meta-block are connected by horizontal (repeating meta-block) 
or diagonal (type 2 meta-block) lines. 


Definition 3.1. The ESP tree ET(y) of a string Y is a slightly modified derivation tree of the CFG defined 
aSofej^ The internal nodes are elements of V\Yi, and the leaves are from Ei. Each leaf refers to a substring 
in Eg or Eg. Its root node is the start symbol r. 

An example is given in Figure 

For convenience, we count the height of nodes from 1, so that the sequence of nodes on height /i, denoted 
by {Y)h, is corresponding to Yh- The generated substring of a node fY)h[i] is the substring of E generated 
by the symbol Yh[i] (applying 'S recursively on Yh[i\). Each node v represents a block that is contained in a 
meta-block g, for which we say that g builds v. More precisely, a node v := {Y)h[i] is said to be built on 
a meta-block represented by (Y)h-i\fi--e\ iff (Y)h-i\h--e\ contains the children of v. Like with blocks, nodes 
inherit the type of the meta-block on which they are built. An overview of the above definitions is given in 
Figure 

Surrounded Nodes. A leaf is called surrounded iff its representing block on text-level is surrounded. 
Given an internal node v on height h+1 (h > 1) whose children are (E)h[/3], we say that v is surrounded 
iff the nodes {Y)h[b{fi) — Ai,..e{/3) + Ar] are surrounded. 

^In the original version, it actually is the derivation tree, but we modify it slightly for our convenience. 
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Figure 3: Let u be a node on (Y)h- The subtree rooted at v is depicted by the white, rounded boxes. The 
generated substring of v is the concatenation of the white, bordered blocks on the lowest level in the picture. 
The meta-block /i, on which v is built, is highlighted by a horizontal hatching of the nodes on height h — 1 
contained in /i. 


3.3 Tree Representation 

We store the ESP tree as a CFG. Every non-terminal is represented by a name. The name is a pointer to 
a data-field, which is composed differently for leaves and internal nodes: 

Leaves. A leaf stores a position i and a length I G {2, 3} such that Y\i..i -1-1 — 1] is the generated substring. 

Internal nodes. An internal node stores the length of its generated substring, and the names of its children. 
If it has only two children, we use a special, invalid name 0 for the non-existing third child such that 
all data fields are of the same length. 


This representation allows us to navigate top-down in the ESP tree by traversing the tree from the root, in 
time linear in the height of the tree. 

We keep the invariant that the roots of isomorphic subtrees have the same names. In other words, before 
creating a new name for the rule b —>■ xyz, we have to check whether there already exists a name for xyz. 
To perform this look-up efficiently, we need also the reverse dictionary of T), with the right hand side of the 
rules as search keys. We use a dictionary of size 0{\Y\), supporting lookup and insert in ©(^a) time. 

More precisely, we assume there is a dictionary data structure, storing n elements in 0{n) space, sup¬ 
porting lookup and insert in 0{t\ + j^j /Ig^n) time for a key of length /, where t\ = t\{n) depends on n. 
For instance, Franceschini and Grossi[s DS [7] with word-packing supports tx = 0(lgn). 


Lemma 3.2. An ESP tree of a string of length n can be built in 0{n (Ig* n -I- tx)) time. It consumes 0{n) 
space. 


Proof. A name is inserted or looked-up in tx time. Due to the alphabet reduction technique, applying esp 
on a substring of length I takes 0{l Ig* n) time, returning a sequence of blocks of length at most 1/2. □ 


3.4 LCE queries on ESP trees 

ESP trees are fairly stable against edit operations: The number of nodes that are differently parsed after 
prepending or appending a string to the input is upper bounded by 0(lgnlg* n) [SJ Lemma II]. To use this 
property in our context of LCE queries, we consider nodes of ET(y) that are still present in ET{XYZ)] a 
node V in ET(y) generating y[io..jo] is said to be stable iff, for all strings X and Z, there exists a node v' in 
ET(XYZ) that has the same name as v and generates (XYZ)[\X\ -|-zo..lXj -I-jo]- We also consider repeating 
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nodes that are present with slight shifts; a non-stable repeating node v in ET(F) generating Y[io..jo] is said 
to be semi-stable iff, for all strings X and Z, there exists a node v' in ET{XYZ) that has the same name 
as V and generates a substring intersecting with {XYZ)[\X\+iQ..\X \ -I-jo]- Then, the proof of Lemma 9 of [5] 
says that, for each height, ET(y) contains 0{\g* n) nodes that are not (semi-)stable, which we call fragile. 
Since the children of the (semi-)stable nodes are also (semi-)stable, there is a border on ET(y) separating 
(semi-)stable nodes and fragile nodes. 

In order to use semi-stable nodes to answer LCE queries efficiently, we let each node have an additional 
property, called surname. A node v := (Y)h[i] is said to be repetitive iff there exists at some 

height h' < h with Yj']!] = where (Y)h'\I-] is the sequence of nodes on height If in the subtree rooted 
at (y)/i[*] and d G S/i'. The surname of a repetitive node v := (Y)h[i] is the name of a highest non-repetitive 
node in the subtree rooted at v. The surname of a non-repetitive node is the name of the node itself. It is 
easy to compute and store the surnames while constructing ETs. 

The connection between semi-stable nodes and the surnames is based on the fact that a semi-stable node 
is repetitive: Let u be the node whose name is the surname of a semi-stable node v. If m is on height h, v’s 
subtree consists of a repeat of u’s on height h. A shift of v can only be caused by adding u’s to the subtree 
of V. So the shift is always a multiple of the length of the generated substring of u. 

Lemma 3.3. Let X and Y be strings with |A| < |y| < n. Given ET(X) and ET(y) built with the same 
dictionary and two text-positions 1 < ix ^ |A|, I < iy < \Y\, we can compute I := lcp{X[ix-],Y[iY.]) in 
0{\g |y I -I- Ig I Ig* n) time. 

Proof. We compute the longest common prefix Z of X[ix-] and y[iy-.] by traversing both ESP trees si¬ 
multaneously from their roots. During the traversal, we trace the border that separates (semi-)stable nodes 
and fragile nodes of Z. Given two stable nodes, we can match their generated substrings by their names 
in constant time. Matching a semi-stable node with a (semi-)stable node can be done in constant time 
due to the surnames: Assume we visit two nodes v and v' (each belonging to one tree), where v is semi¬ 
stable, v' is (semi-)stable, and both have the same name, but the compared positions are shifted. Since v 
and v' are repetitive with the same surname, the match with shift can be done in constant time by using 
their surname. Since the number of visited fragile nodes is bounded by 0(\gllg* n), Z can be computed in 
©(Ig jyj -I- IgZlg* n) time, where ©(Ig jEj) time is needed to traverse both trees from their roots to the first 
stable node. □ 

3.5 Truncated ETs 

Building an ET over a string Y requires ©(jEj) words of space, which might be too much in some scenarios. 
Our idea is to truncate the ET at some fixed height, discarding the nodes in the lower part. The truncated 
version stores just the upper part, while its (new) leaves refer to (possibly long) substrings of E. The 
resulting tree is called the truncated ET (tET). More precisely, we define a height rj and delete all nodes 
at height less than g, which we call lower nodes. A node higher than 77 is called an upper node. The 
nodes at height 77 form the new leaves and are called g-nodes. Similar to the former leaves, their names 
are pointers to their generated substrings appearing in E. Remembering that each internal node has two or 
three children, an 77 -node generates a string of length at least 2^^ and at most 3’^. So the maximum number 
of nodes in a tET of a string of length n is . 
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Similar to leaves, we use the generated substring X of an ? 7 -node v for storing and looking up v. It can 
be looked up or inserted in 0 (|X| / Ig^ n + t\) time. 

These observations lead us to 

Lemma 3.4. We can build a tET of a string Y of length n in 0{n (Ig* n + rj/ Ig^. n + t\/2'^)) time, using 
n) words of working space. The tree consumes 0{nj2'^) space. 

Proof. Instead of building the ESP tree level by level, we compute the ? 7 -nodes node by node, from left to 
right. We can split an ESP parsing of the whole string up into parts. When a new part starts, we read ZIl 
characters of the end of the old part such that the parsing starts with Z\l old characters. These characters 
are necessary to reconstruct the meta-block boundaries, and for the alphabet reduction to produce the same 
results like for the whole string. In our case, a part contains the generated substring of one 77 -node. Since 
an 77 -node generates a substring of at most HP characters, we parse HP + Z\l characters on text level at once, 
creating lower nodes. In order to parse a string of lower nodes by ESP, we have to give them names. 

The names of the lower nodes are created temporarily, and are not stored in the dictionary. Since two 
77 -nodes have the same name iff their substrees are isomorph, the task is to create the name of a lower node 
based on its subtree, and restore its name without using S. To this end, we use the generated substring of 
a lower node as its name. 

Working Space. We need 0{‘iP Ig* n) words of working space in order to construct an 77 -node v. The 
name of v is determined by its subtree and its local surrounding. So we can construct v after computing its 
subtree and its local surrounding. Both contain lower nodes that we store temporarily in the working space. 
With a pointer based representation, the subtree of an 77 -node needs 0{‘iY) words of working space. Since 
we additionally store its local surrounding, we come to 0{HP Ig* n) words of working space. 

Time. The time bound 0{nlg* n) for the repeated application of the alphabet reduction is the same as 
in Lemma l3. 2 1 

While parsing a string of lower nodes, the ESP compares the name of two adjacent lower nodes. Compar¬ 
ing two lower nodes is done by naively comparing the characters represented by their names. We compare 
two lower nodes during the construction of an 77 -node. Let us take the set of lower nodes on height 1 < h < 77 . 
Their generated substrings have a length of n. So we spend Oinj Ig^. ri) time in total for comparing two lower 
nodes on the same height 1 < h < 77 . By summing over all heights, these comparisons take 0 ( 7 ^ 77 /Ig,,. ti) 
time in total. 

By the same argument, maintaining the names of all 77 -nodes takes Olrij Ig^. n Y t\n/2'^) time. 

A name is looked-up in 0{t\) time for an upper node. Since the number of upper nodes is bounded 
by 77 / 2 ’^, maintaining the names of the upper nodes takes 0(t\n/2^) time. This time is subsumed by the 
lookup time for the 77 -nodes. □ 

Lemma 3.5. Let X and Y be strings with |A|, |y| < n. Given ET(A) and ET(y) built with the same 
dictionary and two text-positions 1 < ix ^ j 1 ^ iy ^ \Y\, we can compute lcp{X[ix..],Y[iY..]) in 
0{\g n{\g{n/2'^) Y HP/ Ig^ n)) time. 

Proof. Lemma |3.3| gives us the time bounds when dealing with an ET. According to the lemma, there are at 
most 0(Al Y Ar) many comparisons that examine the leaves of some 77 -nodes. Unfortunately, we cannot 
perform any node comparison on a height lower than 77 on the truncated trees; instead we take the name 


of each respective 77 -node leading us to a substring whose length is upper-bounded by 3''. Comparing both 
77 -nodes is done by checking at most 3’'/lg^n words. Since the height of the tET is bounded by 0{\gn/2'^), 
we use up to 0{\g* n\g{n/2'^)) time for the upper nodes. □ 


With r := 2^ we get Theorem 1.3 


4 Sparse Suffix Sorting 

The sparse suffix sorting problem asks for the order of suffixes starting at certain positions in a text T. In 
our case, these positions can be given online, i.e., sequentially and in an arbitrary order. We collect them 
conceptually in a dynamic set V. Due to the online setting, we represent the order of the suffixes Suf{V) 
starting at those positions by a dynamic, self-balancing binary search tree (e.g., an AVL tree). Each node 
of the tree is associated with a distinct suffix in Suf{V), and the lexicographic order is used as the sorting 
criterion. 

Borrowing the technique of Irving and Love 0 , an AVL tree on a set of strings S can be augmented with 
LCP values so that we can compute I := max{lcp(A, V) | A € 5} for a string Y in 0(1/Ig^. -I- Ig |iS|) time. 
Inserting a new string into the tree is supported in the same time complexity. Irving and Love [5] called this 
data structure the suffix AVL tree on 5; we denote it by SAVL(5). 

Given a text T of length n, we will use SAVL(5 'm/('P)) as a representation for SSA(T, V) and SLCP(T, V). 
Our goal is to build SAVL(5'w/(P)) efficiently. However, inserting suffixes naively suffers from the lower 
bound Q,{n\V\ /\g^n) on time. How to speed up the comparisons by exploiting a data structure for LCE 
queries is topic of this section. 

4.1 Abstract Algorithm 

Starting with an empty set of positions 7^ = 0, our algorithm updates SAVL(S'm/( 7^)) on the input of every new 
text-position, involving LCE computation between the new suffix and some suffixes stored in SP^/L{Suf {V)). 
Our idea is that creating a mergeable LCE data structure on the read substrings may be helpful for later 
queries. In more detail, we need a data structure that 

• answers LCE queries on two substrings covered by instances of this data structure, 

• is mergeable in such a way that the merged instance answers queries faster than performing a query 
over both former instances separately. 

We call this abstract data type dynamic LCE (dynLCE); it supports the following operations: 

• dynLCE(y) constructs a dynLCE M on a substring Y of T. Let M.text denote the string Y on which 
M is constructed. 

• LCE(Mi, M 2 ,pi,P 2 ) computes Zc77(Mi.text[pi..], M 2 .text[p 2 --]), where pi G [I.. |Mi.text|] for i = 1,2. 

• merge (Ml, M 2 ) merges two dynLCEs Mi and M 2 such that the output is a dynLCE on the concate¬ 
nation of Ml.text and M 2 .text. 
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We use the expression tc(|h"|) to denote the construction time on a string Y. Further, + |h"|) and 

tM(|-^| + |h^|) denote the LCE query time and the time for merging on two strings X and Y, respectively. 
Querying a dynLCE built on a string of length £ is faster than the word-packed character comparison 
iff £ = \g n/ Ig a). Hence, there is no point in building a dynLCE on a text smaller than g := 

0 (tL(g) Ig n/lgcr). 

We store the text intervals covered by the dynLCEs such that we know the text-positions where querying 
a dynLCE is possible. Such an interval is called an LCE interval. An LCE interval I stores a pointer to 
its dynLCE data structure M, and an integer i such that M.text[i..z -|- \I\ — 1] = T\I\. The LCE intervals 
themselves are maintained in a self-balancing binary search tree of size 0{\V\), storing their starting positions 
as keys. 

For a new position 1 < p < |r| ,p ^ 7^, updating SAVL(S'u/(7^)) to SAVL(5'w/(7^ U {p})) involves two 
parts: first locating the insertion node for p in SAVL(6'u/(7^)), and then updating the set of LCE intervals. 

Locating. The suffix AVL tree performs an LCE computation for each node encountered while 
locating the insertion point of p. Assume that the task is to compare the suffixes T[i..] and T[j..] for some 
1 < bJ ^ \T\. First check whether the positions i and j are contained in an LCE interval, in 0{lgm) 
time. If both positions are covered by LCE intervals, then query the respective dynLCEs. Otherwise, look 
up the position where the next LCE interval starts. Up to that position, naively compare both substrings. 
Finally, repeat the above check again at the new positions, until finding a mismatch. After locating the 
insertion point of p in SAVL(S'm/(U)), we obtain p := mlcparg^ and I := mlcpp as a byproduct, where 
mlcpargp := argmaXp,gp p^p, lcp{T[p..],T[p'..]) and mlcpp := lcp(r[p..], T[mlcpargp..]) for 1 < p < \T\. 

Updating. The LCE intervals are updated dynamically, subject to the following constraints (see 
Figure]^: 

Constraint 1: The length of each LCE interval is at least g. 

Constraint 2: For every p G V the interval [p..p -I- mlcpp — 1] is covered by an LCE interval except at most g 
positions at its left and right ends. 

Constraint 3: There is a gap of at least g positions between every pair of LCE intervals. 

These constraints guarantee that there is at most one LCE interval that intersects with [p..p -|- mlcpp — 1] 
for a p € V. 


P 


mlcp 


◄-►] 


n-^ 


<9 

LCE interval 

< 9 

M -► 

LCE interval 


>9 


Figure 4: Sketch of two LCE intervals and the constraints. 


The following instructions will satisfy the constraints: li I < g, we do nothing. Otherwise, we have to care 
about [Constraint 2| Fortunately, there is at most one position in V that possibly invalidates Constraint 2 
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3 


L 


Figure 5: The interval T :=[p-\- i..p + j] is not yet covered by an LCE interval, but belongs to [p..p + / — 1]. 
It is accompanied by J' := [p + i..p + j]. 


after adding p, and this is p; otherwise, by transitivity, we would have created some larger LCE interval 
previously. Let U C [l..n] be the positions that belong to an LCE interval. The set [p..p + Z — 1] \ C/ can be 
represented as a set of disjoint intervals of maximal length. For each interval I := [p + i..p + j] C [p..p + Z — 1] 
of that set (for some 0 < i < j < /, see Figurej^, apply the following rules with J := [p+i..p+j] sequentially: 

Rule 1: If is a sub-interval of an LCE interval, then declare I as an LCE interval and let it refer to the 
dynLCE of the larger LCE interval. 


Rule 2'. \l J intersects with an LCE interval K, enlarge /C to /C U >7, enlarging its corresponding dynLCE 


(We can enlarge an dynLCE by creating a new instance and merge both instances). Apply Rule 1 


Rule 3: Otherwise, create a dynLCE on Z, and make Z to an LCE interval. 


Rule 4: If Constraint 3 is violated, then a newly created or enlarged LCE interval is adjacent to another 
LCE interval. Merge those LCE intervals and their dynLCEs. 


We also need to satisfy Constraint 2 on [p..p + Z — 1]. To this end, update U, compute the set of disjoint 
intervals [p..p -I- Z — 1] \ Z7 and apply the same rules on it. 

Although we might create some LCE intervals covering less than g characters, we will restore[Constraint 1| 


by merging them with a larger LCE interval in Rule 4 In fact, we introduce at most two new LCE intervals. 
[Constraint l|is easily maintained, since we will never shrink an LCE interval. 


Lemma 4.1. Given a text T of length n that is loaded into RAM, the SSA and SLOP of T for a set of m 
arbitrary positions can be computed deterministically in 0{tc(\C\) -I- ti^{\C\)m\gm + mtM(|C|)) time. 


Proof. The analysis is split into managing the dynLCEs, and the LCE queries: 

• We build dynLCEs over at most |C| characters of the text. So we need at most tc{\C\) time for 
constructing all dynLCEs. During the construction of the dynLCEs we spend at most C>(|C| / Ig^ n) = 
0{tci\C\)) time on naive searches. 

• The number of merge operations on the LCE intervals is upper bounded by 4 to in total, since we create 
at most two new LCE intervals for every position in Z. So we spend at most 4mtM(|C|) time for the 
merging. 

• LCE queries involve either naive character comparisons or querying a dynLCE. We change the com¬ 
parison technique (LCE by dynLCE, or naive word-comparison) at most 2m times, until finding a 
mismatch. For the latter, the overall time is bounded by 0{tc{\C\) -I- tL{\C\)m\gm): 
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Since we do not create an LCE interval on two substrings with an LCP-value smaller than g 
(Constraint 1|, we spend at most 0{gm\gm/\g^n) = 0{ti^{\C\)m\gm) time on those substrings. 


— Otherwise, the time is already subsumed by the total time for dynLCE-creation. 

For the former, the LCE queries take at most 0{ti^{\C\)m\gm) overall time. 

Looking up an LCE interval is done in 0(lgm) time. For each LCE query, we look up at most 2m 
LCE intervals, summing to 0(m\gm) time, which is subsumed by the time bound for LCE queries. 

□ 


4.2 Sparse SufRx Sorting with ESP Trees 

We will show that the ESP tree is a suitable data structure for dynLCE. In order to merge two ESP trees, we 
use a common dictionary S) that is stored globally. Fortunately, it is easy to combine two ETs by updating 
just a handful of nodes, which are fragile. 

Here we explain which nodes in ET (Y) are fragile. Whether a node is fragile or not is determined bottom 
up and depending on the type of meta-block fi. 

If g is type 2. Since a node is parsed based on its local surrounding, a node is fragile iff it is not 
surrounded or its local surrounding contains a fragile node. 

If /i is a repeating meta-block. A node u of a repeating meta-block is determined among others by 
its children and its three left siblings: If one of v’s children is fragile, v is fragile, too. In addition, if one 
of z;’s three left siblings is fragile, the type of the meta-block to which v belongs can switch from type 2 to 
typeM and vice versa (see Figure]^. The type switch may change the contents of v. 

Moreover, if g, contains a fragile node, or one of the three right-most nodes in the meta-block proceeding 
g is fragile, we treat the last two nodes of g as fragile: Since the greedy blocking partitions a meta-block 
from left to right, the last two nodes absorb always the remaining characters (see Figure]^. 

Lemma 4.2. Given two strings X,Y S E*, assume that we have already created ET(A) and ET(y). Merging 
both trees into EJ{XY) takes 0{t\ (Z\Llg|E| -|- Z\Rlg|A|)) time. 

Proof. We recompute some nodes located at the splice of both trees, from the bottom up to the root: At 
each height h, check the Ar rightmost nodes of {X)h, and some leftmost nodes of (X)h until finding a stable 
node in {Y)h. If the leftmost Ar nodes of {Y)h are type 2 nodes, recompute the ESP for these nodes. After 
processing these Ar nodes we encounter a stable node, and stop. Otherwise, we have to fix a repeating 
meta-block g (like in Figure]^. We restructure g with the following operations: Go to g's right end in 
0{\g |/r|) time, involving tree climbing and node skipping based on the subtree sizes and the names. Then 
reparse the fragile nodes. Since the fragile property is propagated upwards, we recompute its ancestors and 
their fragile sibling nodes. We mark the recomputed fragile nodes such that we will not recompute them 
again. By this strategy every solid type 2 node gets marked before we visit it during the re-computation of 
the type 2 nodes in the first case. Since there are 0 (Al Ig |T|) fragile nodes, we spend 0{Ai,t\ Ig |y|) time 
for the fragile nodes in total. Finally, climb up one height on both trees. □ 

We conclude that the ETs are a representation of dynLCE. The following theorem combines the results 
of Lemmas 14.11 and 14.21 
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Figure 6: We prepend the string aab to the text a*^ character by character, and compute the parsing each 
time. The last row shows an example, where a former type 1 meta-block changes to type M, although it is 
right of a type 2 meta-block. Here, k mod 3 = 2. 


Theorem 4.3. Given a text T of length n and a set of m text positions V, SSA(T, V) and SLCP(T, V) can 
he computed in 0(\C\ (Ig* n + t\) + m\gm\gn\g* n) time. 


Proof. We have th{\C\) = C>(lg* nlgn) due to Lemma 3.3 5 = 0(lg* n Ig^ n/ Ig tr), tc(|C|) = C>(|C| (Ig* n + t\)) 


due to Lemma pO| and tM(|C|) = 0{tx\gnlg* n) due to Lemma [4^ Actually, the costs for merging is al¬ 
ready upper bounded by the tree creation: Let S < mhe the number of LCE intervals. Since each ET covers 
at least g characters, Sg < 0{\C\) holds, and we get (5tM(|C|) < \C\tM{\C\)/g = 0{\C\tx) overall time for 
merging. 

By applying these results to Lemma 4.1 we get the claimed time bounds. □ 


5 Hierarchical Stable Parsing 

Remembering the outline in the introduction, the key idea is to solve the limited space problem by storing 
dynLCEs in text space. Taking two LCE intervals on the text containing the same substring, we overwrite 
one part while marking the other part as a reference. By choosing a suitably large ry, we can overwrite 
the text of one LCE interval with a tET whose ry-nodes refer to substrings of the other LCE interval. 
Merging two tETs involves a reparsing of some ?y-nodes (cf. Figure]^ and Figure]^. Assume that we want 
to reparse an 7 y-node v, and that its generated substring gets enlarged due to the parsing. We have to locate 
a substring on the text that contains its new generated substring X. Although we can create a suitable 
large string containing X by concatenating the generated substrings of its preceding and succeeding siblings, 
these ry-nodes may point to text intervals that may not be consecutive. Since the name of an 7 y-node is 
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aaa = • • • = aaa = aaa = aaa 
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aaa = • • • = aaa = aaa = aa = aa 

rnr~irnrnrnrn 
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[ B J _r~n_ [ B ] [ B K B K B j 

aaa = • • • = aaa = aaa = aaa = aaa 

Figure 7: The greedy blocking is related to the Euclidean division by three. The remainder k mod 3 is 
determined by the number of characters in the last two blocks (here, k mod 3 = 0). In this example, esp 
created a single, repeating meta-block on each input. 


(ab)^a® = 


(ab)"a3“ = 

Figure 8: We merge ET|^(ab)®a®^ with ET(a^‘^) (both at the top) to ET^(ab)®a®°^ (bottom tree). Reparsing 
a repeating meta-block of the right tree changes its fragile nodes. 

the representation of a single substring, we have to search for a substring equal to X in the text. Because 
this would be too inefficient, we will show a slight modification of the ESP technique that circumvents this 
problem. 
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Figure 9: We take Y from Figure and prepend the character a to it. Parsing aF with ESP generates a 
tree that is very different to Figure]^ 
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5.1 Hierarchical Stable Parse Trees 

Our modification, which we call hierarchical stable parse trees or HSP trees, affects only the definition 
of meta-blocks. The factorization of meta-blocks is done by relaxing the check whether two characters are 
equal; instead of comparing names we compare by surnamej^ This means that we allow meta-blocks of 

®The check is relaxed since nodes with different surnames cannot have the same name. 
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type 1 to contain heterogeneous elements as long as they share the same surname (cf. Figure [T^. The other 
parts of the algorithm are left untouched; in particular, the alphabet reduction uses the names as before. 
We write HT(F) for the resulting parse tree when HSP is applied to a string Y. 

We directly follow that 

(a) the generated substring of a repetitive node is a repetition. 

(b) consecutive, repetitive nodes with the same surname are grouped into one meta-block The generated 
substring of each node in /r is a repetition with the same root, but with possibly different exponents. 
The exponents of the generated substrings of the last two nodes cannot be larger than the exponents 
of the other generated substrings. 

(c) A node n of a repeating meta-block /r is non-repetitive iff ^ is type M and contains the single character 
that got merged with a former repetitive meta-block. The node v can only be located at the begin or 
end of fj,. If fj, is the leftmost or rightmost meta-block, this node cannot be surrounded (see Figure [T^ . 
By[(b)l V is either stable or non-surrounded. 

Lemma 5.1. An HSP tree on an interval of length I can be built in 0{l (Ig* n -I- t\)) time. It consumes 0{l) 
space. 


r 

a 


) 

( 

E (a) }C 

/3 (U) 

) 

(b 

(a) 2 b (a)) [ 

"T-Hir 

J 

aaa = aaa 

abc ^ abc 



Figure 10: The HSP creates nodes based on the surnames. Repetitive nodes are labeled with their name, 
followed by their surname in parenthesis. 
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Figure 11: We take the text T of Figurej^and build HT(r) (left) and HT(ar) (right) like in Figure]^ The 
HSP does not change any edges of HT(T), thanks to the usage of surnames. 


The motivation of our modification will become apparent when examining the fragile nodes that are the 
last two nodes on a repeating meta-block. Although they could still change their names when prepending 
characters to the text, their surnames do not get changed by the parsing: Focus on height h in HT(F), and 
look at two meta-blocks lfY)h[v\ and (F)ft,[/u] with e{v) < b(^). Assume that i/ is a type 1 meta-block, and 
that p is surrounded. The nodes of v are grouped in meta-blocks by surnames., and the surnames of the 
fragile nodes belonging to repeating meta-blocks cannot change by prepending some string to the input. So 
nodes contained in p are stable (Figure 121. Due to the way typeM meta-blocks are created, the same is 
true when ^ is a typeM meta-block (compare Figure 13 with Figure [T^ . Hence, the modification prevents 
a surrounded node from being changed severely, which is formalized as: 
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Figure 12: Assume that a^(ba)^ is a prefix of Yh on some height h. The parsing creates a repeating meta¬ 
block consisting of the characters a^, and a type 2 meta-block containing the characters (ba)'^. For k >2 
it is impossible to modify the latter meta-block by prepending characters (bottom figure), since the parsing 
always groups adjacent nodes with the same surname into one repeating meta-block. 


Lemma 5.2. If a surrounded node is neither stable nor semi-stable, it can only be changed to a node whose 
generated substring is a prefix of the generated substring of an already existing node. 

Proof. There are two (non-exclusive) properties for a node to be fragile and surrounded: It belongs to the 
last two nodes built on a repeating meta-block, or its subtree contains a fragile surrounded node. Let v be 
one of the lowest surrounded fragile nodes. Since v cannot contain any fragile surrounded node, it is one of 
the last two nodes built on a repeating meta-block {Y)h[p]. Moreover, {Y)h[b{p) — 3..b(^) — 1] contains a 
fragile non-surrounded node. But since v is surrounded, the condition |^| > Al > 8 (for n > 4) holds; so 
there is a repetitive node u consisting of three nodes in p,. Any node with the same surname (as u or v) 
generates a substring that is a prefix of the generated substring of u. The parsing of HSP trees assures that 
any surrounded node located to the right of {Y)h[p] is stable. This situation carries over to higher layers 
until the number of nodes with the same surname gets shrunken below 8, at which fragile nodes containing 
V in their subtrees are not surrounded anymore. Therefore, the proof is done by recursively applying this 
analysis. □ 


5.2 Sparse SufRx Sorting in Text Space 

The truncated HT (tHT) is the truncated version of the HT. It is defined analogously as the tET (see 
Section 3.5), with the exception of the surnames: For each repetitive node, we mark whether its surname is 
the name of an upper node, of an 77 -node, or of a lower node. Therefore, we need to save the names of certain 
lower nodes in the reverse dictionary of 'S. This is only necessary when an upper node or an 77 -node v has a 
surname that is the name of a lower node. If v is an upper node having a surname equal to the name of a 
lower node, the 77 -nodes in the subtree rooted at v have the same surname, too. So the number of lower node 
entries in the reverse dictionary is upper bounded by the number of ? 7 -nodes, and each lower node generates 


a substring of length less than 3’’. We conclude that the results of Lemma 3A and Lemma [3. 5 1 apply to the 
tHT, too. 
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Figure 13: A type M meta-block is created by fusing a single character with its sibling meta-block. We 
choose the tie breaking rule to fuse the character with its right meta-block. In order to see why this rule 
is advantageous, we temporarily modify the parsing to choose the left meta-block, if applicable. Let us 
examine the trees of F = a^^ba^{ba)^ (top) and aF (bottom). In both trees, the two rightmost blocks of 
the type M meta-block on the bottom level are children of the leftmost node of the right meta-block on the 
next level. Prepending a character to F may change those bottom level nodes and therefore change the right 
meta-block on height 2 . 


For proving Theorem o it remains unclear how the tHTs are stored in text space, and how much the 
running time for an LCE query gets affected. Considering the construction in text space, it is nontrivial to 
give an efficient solution for [Rule 3| and [Rule 4| in Sect ion [FT| In the following, we will hx ry in Lemma [573| and 
develop Corollary |5.4| to deal with [Rule 3[ Corollary |5.5| for the LCE queries, and Corollary |5.6| for 

Assume that we want to store tHT(r[Z]) on some text interval I. Since tHT(T[I]) could contain nodes 
with |I| distinct names, it requires 0(|F|) words, i.e., 0[\I\ Ign) bits of space that do not fit in the \T\ Igtr bits 
of T\X]. Taking some constant a (independent of n and u, but dependent of the size of a single node), we 
can solve this space issue by setting r] := log 3 (a Ig^ n/ Ig cr): 

Lemma 5.3. Let rj = logg (alg^n/lga). Then the number of nodes is bounded by 0(Z(lgcr)°-^/(lgn)^-^). 
An rj-node generates a substring containing at most |'Q;(lg n)^/(lg cr)] characters. 

Proof. The generated substring of an 77 -node is at least 2^ long, and takes at least 


Rule 4 


2’' Ig (T = 3’'(2/3)’' Ig a = a{a Ig" nj Ig a)'°S3 2-1 ig2 „ 


(a^°«3 2)(lg„)21og3 2(jg^)l-log3 2 > a0.6(^g^)1.2(^g^)0.3 


bits, where we used that 77 = log 3 (alg^ n/lgtr) = (logg 2—1) log 2 / 3 (alg^ n/\ga). So the number of nodes is 
bounded by 1/2^ < ngcr/(a°®(lg 7 r)^-^(lgcr)°-^) = 0 (Z(lgcr)°’^/(lgn)^-^). □ 


Applying Lemma 5.3 to the results elaborated in Section 3.5 for the tETs yields 


Corollary 5.4. IFe can compute a tHT on a substring of length I in 0{llg* n + txl/2’^ + llglgn) time. The 
tree takes 0{l/2^) space. We need a working space of n\g* n/\ga) characters. 

Proof. We follow the proof of Lemma [3.4| The tree has at most 1/2^ nodes, and thus takes 0{l/2^) text 
space. Constructing an 77 -node uses 0{?P\g* n) = C>(lg^ nig* 77 /Ig cr) characters as working space. □ 
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Figure 14: We apply HSP with our tie breaking rule defined in Section 3.1 to the strings Y (top) and aY 


(bottom) of Figure 13 Only the fragile nodes of the leftmost meta-blocks on each height may differ. 


Corollary 5.5. An LCE query on two tHTs can he answered in 0{\g* nlgn) time. 


Proof. LCE queries are answered as in Lemma 3.5 The value of r] is set so that 3^^ Ig cr = a Ig n holds. Since 
an Ty-node generates a substring comprising alg^ n/ Ig a characters, we can check the subtree of an 77- node 
by examining algn words. Overall, these additional costs are bounded by 0{{Ai, + Z\r) Ign) time, and do 
not worsen the running time C(lg* n (lg(n/2'') -E algn)) = 0{lg* nlgn). □ 


We analyze the merging when applied by the sparse suffix sorting algorithm in Section [4.1| Assume that 
our algorithm found two intervals [i..i -E ? — 1] and [j..j -E ^ — 1] with T[i..i -E Z — 1] = T[j..j -E ^ — 1]. Ideally, 
we want to construct tHT(T[z..i -E Z — 1]) in the text space [j..j -E ? — 1], leaving T[i..i -E Z — 1] untouched so 
that parts of this substring can be referenced by the 77 -nodes. Unfortunately, there are two situations that 
make the life of a tHT complicated: 

• the need for merging tHTs, and 

• possible overlapping of the intervals [i..i -E ^ — 1] and [j..j -E / — 1]. 

Partitioning of LCE intervals. In order to merge trees, we have to take special care of those 77 -nodes 
that are fragile, because their names may have to be recomputed during a merge. In order to recompute 
the name of an 77 -node v, consisting of a pointer and a length, we have to find a substring that consists 
of v’s generated substring and some adjacent characters with respect to the original substring in the text. 
That is because the parsing may assign a new pointer and a new length to an 77 -node, possibly enlarging the 
generated substring, or letting the pointer refer to a different substring. 


The name for a surrounded fragile 77 -nodes v is easily recomputable thanks to Lemma 5.2 Since the new 
generated substring of u is a prefix of the generated substring of an already existing 77 -node w, which is found 
in the reverse dictionary for 77 -nodes, we can create a new name for v from the generated substring of w. 

Unfortunately, the same approach does not work with the non-surrounded 77 -nodes. Those nodes have a 
generated substring that is found on the border area of T[j..j + 1 — 1]. If we leave this area untouched, we can 
use it for creating names of a non-surrounded 77 -node during a reparsing. Therefore, we mark those parts of 
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the interval [j..j + Z — 1] as read-only. Conceptually, we partition an LCE interval into subintervals of green 
and red intervals (see Figure 15); we free the text of a green interval for overwriting, while prohibiting 
write-access on a red interval. The green intervals are managed in a dynamic, global list. We keep the 
invariant that 


Invariant 1: f := \2a\^ nAj^/\ga\ = 0(g) positions of the left and right ends of each LCE interval are red. 


This invariant solves the problem for the non-surrounded nodes. 





Figure 15: The border areas of each LCE interval are marked read-only such that we can reparse non- 
surrounded nodes and merge two trees. 


Allocating Space. We can store the upper part of the tHT in a green interval, since 1/2^ Ign < 
;Q,o.6(ig^)0.7/(igj^) 0.2 _ o(Zlgcr) holds. By choosing g and a properly, we can always leave /Igcr/lgn = 
0{\g* nign) words on a green interval untouched, sufficiently large for the working space needed by Corol¬ 
lary [5^ Therefore, we pre-compute a and g based on the input T, and set both as global constants. Since 
the same amount of free space is needed during a later merging when reparsing an g-node, we add the 
invariant that 


Invariant 2: each LCE interval has flga/lgn free space left on a green interval. 


For the merging, we need a more sophisticated approach that respects both invariants: 

Merging. We introduce a merge operation that allows the merge of two tHTs whose LCE intervals have 
a gap of less than g characters. The merge operation builds new g-nodes on the gap. The g-nodes whose 
generated substrings intersect with the gap are called bridging nodes. The bridging nodes have the same 
problem as the non-surrounded g-nodes, since the gap may be a unique substring of T. 

Let I and be two LCE intervals with 0 < h{J) — e(I) < g, where on each interval a tHT has been 
computed. We compute tHT(T[b(I)..e(J')]) by merging both trees. By Lemma 4.2 at most 0{Ai^ + Z\r) 


nodes at every height on each tree have to be reprocessed, and some bridging nodes connecting both trees 
have to be built. Unfortunately, the text may not contain another occurrence of T\e{I) — + /] 

such that we could overwrite T[e{I) — + /]. Therefore, we mark this interval as red. So we can 

use the characters contained in T[e{I) — -|- /] for creating the bridging g-nodes, and for modifying 

the non-surrounded nodes of both trees (Figure [T^. Since the gap consists of less than g characters, the 
bridging nodes need at most 0{\gnlg* n) additional space. By choosing g and a sufficiently large, we can 
maintain [Invariant 2| for the merged LCE interval. 

Interval Overlapping. Assume that the LCE intervals [i..i -I- Z — 1] and {j..j -f Z — I] overlap, without 
loss of generality j > i. Our goal is to create tHT(T[i..i -|- Z — 1]). Since T[i..i -|- Z — 1] = T[j..j -|- Z — 1], 
the substring T[i..j -|- Z — 1] has a period 1 < d < j — i, i.e., T[i..j -|- Z — 1] = X^Y, where |A| = d and 
y is a prefix of X, for some k > 2. First, we compute the smallest period d < j — i of T[i..j -|- Z — 1] in 
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0{l) time [TT] , By definition, each substring of T[j + f..j + I — 1] appears also d characters earlier. The 
substring T[i..i + d + f — 1] is used as a reference and therefore marked red. Keeping the original characters 
in T[i..i + d + f — 1], we can restore the generated substrings of every ry-node by an arithmetic progression: 
Assume that the generated substring of an ry-node is T[b..e] with i + d + f < b < e < j + I — 1. Since 
\\b..e]\ < /, we find a fc > 1 such that T\b..e] = T[[& — d^..e — d^] and [b — d^..e — d^] C [i..i + d + / — 1]. 
Hence, we can mark the interval [i + d f ..j -\-1 — 1 — f] green. The partitioning into red and green intervals 
is illustrated in Figure [T7| 



I -► J 

[e(I)-/..b(J) + /] 


Figure 16: The merging is performed only if the gap between both trees is less than g. The substring 
T[e(I) — + /] is marked red for the sake of the bridging nodes. 


Finally, the time bound for the above merging strategy is given by 


Corollary 5.6. Given two LCE intervals I and J witht) < g. We can build 

in 0{g Ig* n + txg/2^ + gr]/ Ig^ n + t a Ig* n Ig n) time. 


Proof. We simulate the algorithm of Lemma 4.2 on a tHT. By Invariant 2 there is enough space left on 
a green interval to recompute the nodes considered in the proof of Lemma |4.2[ and to create the bridging 
nodes in fashion of Corollary |5.4[ Both creating and recomputing takes overall 0{g Ig* n + txg/T^ + g-qj Ig^ n) 
time. □ 


There is one problem left before we can prove the main result of the paper: The sparse suffix sorting 
algorithm of Section |4.1| creates LCE intervals on substrings smaller than g between two LCE intervals 
temporarily when applying [Rule 3[ We cannot afford to build such tiny tHTs, since they cannot respect 
both invariants. Since a temporarily created dynLCE is eventually merged with a dynLCE on a large LCE 
interval, we do not create a tHT if it covers less than g characters. Instead, we apply the new merge operation 
of Corollary |5.6| directly, merging two trees that have a gap of less than g characters. With this and the 
other properties stated above, we come to 
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Figure 17: Assume X = [p..p + Z — 1] and J = [p..p + Z — 1] are overlapping. T\I U J] has the smallest 
period d. To avoid the fragmentation of generated substrings, it is sufficient to make f + d characters on the 
left read-only. 


Proof of Theorem \l . 1\ The analysis is split into suffix comparison, tree generation and tree merging: 

• Suffix comparison is done as in the proof of Lemma |4.1[ LCE queries on ETs and tHTs are conducted 


in the same time bounds (compare Lemma 3.3 with Corollary 5.5). 

All positions considered for creating the tHTs belong to C. Constructing the tHTs costs at most 


0{\C\ Ig* n-\-tx \C\ /2'' -I- \C\ Iglgn) overall time, due to Corollary 5.4 


• Merging in the fashion of Corollary 5.6 does not affect the overall time: Since a merge of two trees 
introduces less than g new text positions to an LCE interval, we follow by the proof of Theorem |4.3| 
that the time for merging is upper bounded by the construction time. 


By Lemma |4.1[ the time for generating and merging the trees is bounded by 

0(\C\ Ig* n + tx |C| /2^ + |C| Iglgn) = 0{\C\ (tA(lgcT)° V(lgn)i-2 + Iglgn)) = 0{\C\ x/W^) , 




since tx G O(lgn). The time for searching and sorting is 0{m\gm\g* n\gn). The external data structures 
used are SI^\-{Suf ifP)) and the search tree for the LCE intervals, each taking 0{m) space. □ 
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